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Abstract — This paper investigates energy efficiency for two- 
tier femtocell networks through combining game theory and 
stochastic learning. With the Stackelberg game formulation, a 
hierarchical reinforcement learning framework is applied to 
study the joint average utility maximization of macrocells and 
femtocells subject to the minimum signal-to-interference-plus- 
noise-ratio requirements. The macrocells behave as the leaders 
and the femtocells are followers during the learning procedure. 
At each time step, the leaders commit to dynamic strategies 
based on the best responses of the followers, while the followers 
compete against each other with no further information but the 
leaders' strategy information. In this paper, we propose two 
learning algorithms to schedule each cell's stochastic power levels, 
leading by the macrocells. Numerical experiments are presented 
to validate the proposed studies and show that the two learning 
algorithms substantially improve the energy efficiency of the 
femtocell networks. 

I. Introduction 

The rapidly rising energy consumption and greater aware- 
ness of environmental protection have created an urgent need 
for energy-efficient wireless communication techniques. In 
wireless cellular networks, the radio access part is the main 
source of energy consumption, accounting for up to more 
than 70% [1|. Therefore, it's crucial to increase the energy 
efficiency in wireless access networks to meet the challenges 
raised by the exponential growth in mobile services and energy 
consumption. 

Recently, femtocells have been considered as a promising 
technique for improving the indoor experience of the mobile 
users. Due to the short transmit-receive distance property, 
femtocell technique can greatly lower transmit power, prolong 
handset battery life, and achieve improved reception |f2)- 
Although femtocells bring significant benefits to both mobile 
operators and costumers, to make the deployment feasible, 
there still exist many technical challenges, e.g., cross-tier/co- 
tier interference management. In [3|, the authors proposed a 
distributed utility-based signal-to-interference-plus-noise ratio 
(SINR) adaptation algorithm to alleviate the cross-tier inter- 
ference at the macrocell from co-channel femtocells. In (4), 
a Stackelberg game was formulated to study the resource 
allocation in two-tier femtocell networks, where the macro 



base station (MBS) protects itself by pricing the interference 
from femtocell users (FUs). 

This paper discusses the energy efficiency in femtocell 
networks. In J5), the problem of energy-efficient spectrum 
sharing and power allocation in cognitive radio femtocells 
was studied, where a three-stage Stackelberg game model was 
formulated to improve the energy efficiency. In J6j, the authors 
proposed a novel energy saving procedure for the femtocell 
base station (FBS) to decide when to switch on/off. Moreover, 
we focus mainly on discussing the co-channel operation of 
femtocells with closed access. The interference in this sce- 
nario can not be handled by means of centralized network 
scheduling, because the number and locations of femtocells are 
unknown. For such considered networking environment, the 
femtocells are most likely to be autonomous. Existing research 
works on reinforcement learning (RL) [7| have been carried 
out in femtocelss networks. In J8), a realtime multi-agent RL 
algorithm that optimizes the network performance by man- 
aging the interference in femtocell networks was proposed. 
In J5J, a distributed learning scheme based on Q-learning 
was developed to manage the femto-to-macrocell cross-tier 
interference in femtocell networks. 

In this paper, we model the energy efficiency aspect of 
power allocation problem in femtocell networks as a Stackel- 
berg learning game, or leader-follower learning process, with 
the following characteristics: 1) the macrocells are considered 
to be the leaders, whereas the femtocells are considered to be 
the followers; 2) the leaders behave by knowing the response 
of the femtocells to their own strategy decisions; 3) given 
the leaders' decisions, the followers compete with each other. 
Learning is accomplished by directly interacting with the 
surrounding environment and properly adjusting the strategies 
according to the realizations of achieved performance. The 
solution of such a learning game is the Stackelberg equilibrium 
(SE). If no hierarchy exists during the learning procedure, 
the Stackelberg learning game reduces to the non-cooperative 
learning game, which is the scenario discussed in ifTOl . 

The rest of this paper is organized as follows. The next 
section presents the energy efficiency problem in femtocell 
networks. Section III defines a Stackelberg game theoretic 




Fig. 1 . Macrocell is underlaid with femtocells located in the coverage of the 
macrocell (MBS: macro base station; MU: macrocell user; FBS: femto base 
station; FU: femtocell user.). 



solution for the users' hierarchical behaviors. In Section IV, 
a Stackelberg learning framework is proposed, and two RL 
based algorithms are further derived. The numerical results 
are included in Section V, verifying the validity and efficiency 
of the proposed algorithms. Finally, we present in Section VI 
a conclusion of this paper. 

II. Problem Formulation 

The network scenario considered in this paper is shown in 
Fig. [T| where there exist multiple femtocells and macrocells. 
Each macrocell consisting of a MBS and multiple macro 
users (MUs), is underlaid with several co-channel femtocells. 
In each femtocell, there is one FBS providing service to 
femtocell users (FUs). Here we assume the closed policy, since 
customers may prefer that policy because of privacy concerns 
and limited backhaul bandwidth. Assume same distribution 
of femtocells in each macrocell, we concentrate on the case 
of one macrocell for simplification without loss of generality. 
Suppose N femtocells Bi(i > 1) operate within the coverage 
of the macrocell Bq. Users of the same macrocell/femtocell 
adopt time division multiple access (TDMA) for data trans- 
mission, thus causing no interference for each other. In the 
following, this paper mainly addresses the uplink transmis- 
sions in the cell over a common spectrum band. 

Let i 6 Af denote the scheduled user connected to its BS 
Bi, where Af — {0,1,..., N} refers to the user index set. 
Designate the power level of user i as pi £ Pi = [p™ m ,pf ax ], 
the SINR 7j of user i at Bi is given by 



hidPi 



Here —i denote all the users in the set Af except user i, a 1 
is the variance of background Additive White Gaussian Noise 
(AWGN), and {hj^} is the set of channel gains from user j 
to Bi . Each user aims to improve the energy efficiency, which 



can be defined as 



& (Pi,P- 



W\og 2 (l + 7< (pt,V-i)) 



Pi + Pi 

where W is the spectrum bandwidth, and p" denotes the 
additional circuit power consumption of user i during trans- 
missions and is independent from the transmission power. 

Given the minimum SINR requirement 7* chosen to ensure 
the adequate QoS, we can express the utility function of the 
user i formally as 



(Pi»P-i) 



6(Pi,P-i), if li(Pi,V-i) >lt 



0. 



otherwise. 



Whenever the macrocell's 7q becomes infeasible, MU trans- 
mits with maximum power level. We propose that the MBS 
requests the FUs to reduce their minimum SINR targets. Each 
user i € M tries to find the optimal power level that 
maximizes its utility, 



max Ui 

PiEPi 



(Pi>P-i) 



(1) 



III. Stackelberg Game Theoretic Solution 



Within the Stakelberg game framework, the MU is modeled 
as the leader, while the FUs as the followers competing with 
each other over the spectrum. In order to investigate the 
existence of an SE, we first define p* to be the best response 
to p , if 

Ui (p*,P-i) > u i (Pi>P-») ^Pi e p i- 

User i's best response to p ; is denoted as BRi(p_ i ). Let 
NE(po) be the Nash equilibrium (NE) strategy of the FUs if 
the MU chooses to play po, i.e. 

NE(p ) =p_ , if jh =BR i (p_ i ),Vi eAT\{0}. 

The strategy profile (pq,NE(pq)) is an SE if and only if 

u (p* ,NE{p*)) > u (po,NE(p )),Vpo e P . 

The following theorem establishes the existence of the SE. 

Theorem 1. The SE always exists in our defined Stackelberg 
game with the MU leading and the FUs following. 

Proof: For Vpo £ -fbi there is at least one NE in the non- 
cooperative game G = (po, Af\{0}, {Pi}, {«»}), since for 
VieA/"\{0} 

1) the strategy profile Vi is a non-empty, convex, and 
compact subset of some Euclidean space W l ; 

2) Ui is continuous in p_ i and quasi-concave in pi. 
Therefore, there exists pg e Po, such that 

u {po,NE(po)) = sup u (po,NE(p )) . 

P06-P0 

We can conclude that the SE always exists in the defined 
Stackelberg game. ■ 
Note that there is only one leader in the Stackelberg 
game. The MU regards itself as the leader and performs the 
Stackelberg strategy, and the FUs will act their best responses 
until reach the equilibrium. 



IV. Stackelberg Learning Framework 

In the Stackelberg learning game, each user in the network 
behaves as an intelligent agent, whose objective is to maximize 
its expected utility. The game is played repeatedly to achieve 
the optimal strategies. The Stackelberg learning framework 
has two levels of hierarchy: 1) the MU aims to maximize 
its expected utility by knowing the responses of all FUs for 
each possible play; 2) given the strategy of the MU, the 
FUs play a non-cooperative game among each other. In this 
section, we focus our emphasis on how to reach the optimal 
communication configuration through RL approach 0. 

A strategy for user i G M at time step t is defined to be 
a probability vector y* = (yf^, . . . , y| m .) G y if where yfj 
means the probability with which the user i chooses action 
(transmission power) pi j G V L , and jy* is the strategy set 
available to user i. Since each user can only choose a power 
level from a finite discrete set, V% is assumed to be a finite 
set with dimension m^. Then the expected utility for user i at 
time slot t can be expressed as follows 

Ui (y*,y*_ t ) = E [w>ser j plays strategy y*-, j G Af] 

= E^(p t )II^ 

where p* = (p f Q Jo , . . . ,p* N . j is the vector of actions chosen 
by all users at time t, and V — X ie^Vi is the set of all action 
vectors. 

For any stationary strateg)Q of the MU y G 34) > the best- 
response strategies of all FUs define an NE strategy NE(y Q ), 
i.e. NE(y ) = y*_ , if 

y* = arg max U t (y 4 ,y_ 2 ) ,Vi G M\{0}. 

The MU's optimal strategy is then 

y„ = arg max U (y ,AE(y )) . 

Together (y^Ms^)) constitute a stationary strategy of SE 
for the Stackelberg learning formulation. 

Theorem 2. For the considered Stackelberg learning game, 
there exist a MU's stationary strategy and a FUs' NE strategy 
that form an SE. 

Inspired by ifTTI . we can prove Theorem 2 as follows. 

Proof: If the MU follows a stationary strategy y , then 
the Stackelberg learning game is simplified to be a iV-player 
stochastic learning game. It has been shown in [12| that every 
finite strategic-form game has a mixed strategy equilibrium, 
that is, there always exists a NE(y a ) in our formulation of the 
discrete power allocation game given the MU's strategy y . 
The rest of the proof follows directly from the definition of 
SE, and is thus omitted for brevity. ■ 

*A strategy is said to be stationary, where y i = (3/1,1, • ■ ■ ,3/i,m;) is not 
changing with time during the stochastic learning process. 



A. Reinforcement Learning based Algorithm-I (RLA-I) 

During the Stackelberg learning process, the MU behaves 
as the leader and knows the strategy information of all FUs. 
Users with RL ability learn to maximize its individual expected 
utility through repeated interactions with the networking en- 
vironment. Among many different implementation of above 
adaptation mechanism, here we consider the so-called Q- 
learning [13], where the users' strategies are parameterized 
through Q-functions that characterize relative expected utility 
of a particular power level. More specifically, let Q\ (fijj 
denote the Q-value of user i's corresponding power level pi di 
at time t. Then, after selecting power level pi di at time t + 1, 
the Q-value is updated according to 

Ql +1 (PijJ = Q\ {Pi,u) + * (p^y^ 1 ) - Q\ (ftjj) , 

(2) 

where a G [0, 1) is the learning rate, and 

u,(p^y^)= E 'M/^.p'-, 1 ) II vlt- ( 3 ) 

Each FU i G A/"\{0} doesn't know other competing FUs' 
private strategies y^J , and the only information it has is 
the MU's transmission parameters. But the FU i is able to 
compute the attainable utility Ui(pij i , p^ 1 ) with the feedback 
information from its receiver. Therefore, FU i can estimate 
UiiPij^J 1 ^ 1 ) using 

where U^ 1 (pi^P^) is S iven in E q- ©■ n U, n ,Po, JO is the 
number of times when MU selects power level poj and FU 
i selects power level pi,j i until time slot t. The updating rule 
in Eq. (|2|i can then be rewritten as 

Q* +1 (^) = 

Q\ (Ph*) + a (Ui ( Kjl ,y t + 1 ) - Q\ (Piji)) , if i = 0; 
Q\ (ftj, ) + » {Ul +1 (p idi ) - Q\ (pi tk )) , otherwise. 

(6) 

Then, we need to specify how the user selects power 
levels. Greedy selection, when the action with the highest Q- 
value is selected, might generally lead to globally suboptimal 
solution. Thus, we need to incorporate some way of exploring 
less-optimal strategies. Here, we focus on Boltamann action 
selection mechanism, that is, the probability of choosing power 
level pij f at time t is given by 

t+l _ exp(Q* {Pi^y/n) 

Vidi TTA^v{Q\{pi,i)/ny () 

where the temperature Tj controls the exploration/exploitation 
tradeoff Q. 

Now we present the reinforcement learning based algorithm- 
I for the Stackelberg learning game. 



RLA-I 



^ (pu.,p&) = | : <^ J0 + i + Ui In*'**)'*** = ">•*>> (5) 

Ut (pi,ji , Pi jo ) i otherwise. 



Initialization: 

£ = 0, initialize Q-values Q*(pi,j 4 ), for Vi e TV and 

Vpij ( e Pi. 

Learning: 

1) The MU chooses action po,j according to j/q and 
broadcasts its strategy information to all the FUs in the 
network. 

2) Each FU i £ A/"\{0} selects action according to 
and sends its relevant strategy information to the MBS. 

3) Users measure their SINR 7^ with the feedback infor- 
mation of the intended receiver. If 7$ > 7*, then £j (p) 
can be achieved; otherwise, the receiver can not receive 
correctly, thus obtains zero. 

4) The MU updates Ui (Pi,^ , y*Lj) according to Eq. (f3j); 
and the FUs update Uj (pj j i ) basing on Eq. (0 and Eq. 
©. 

5) Update Q-values Q\ (pi.jj according to Eq. (O. 

6) Update the strategies y*^ 1 according to Eq. (0. 

7) t = t+l. 
End Learning 



The learning algorithm results in a stochastic process of 
choosing power level, so we need to investigate the long-term 
behavior of the learning procedure. Along with the discussion 
in lfT4l . we obtain the following differential equation describ- 
ing the evolution of the Q-values: 

d(}[ [,,,., i f a ( U i (P«i.y-i) - Q\ fa*)) , if * = 0; 
di I a ([7* (Pi,jJ - Q- (Pi,j 4 )) i otherwise. 

(8) 

Next, we would like to express the dynamics in terms of 
strategies rather than the Q-values. Toward this end, we 
differentiate Eq. (0 with respect to t and use Eq. (0 The 
set of equations expressed by Eq. (0 can be arrived. 

The first term in braces of Eq. (0 asserts that the probability 
of choosing power level pij i increases with a rate proportional 
to the overall efficiency of that strategy, while the second term 
describes the BS's tendency to randomize over possible power 
levels. Let Y t = (y l , . . . , y^) the strategy profile of all users 
at time t. In the following analysis, we resort to an ordinary 
differential equation (ODE) whose solution approximates the 
convergence of Y l [15|. Then, as a 1 — > 0, we can represent 
the right-hand side of Eq. (0 by a function / (Y*). With any 
initial Y° = Yq, Y* will converges weakly, as a — > 0, to 
Y* = (yo,iVE(yo)) which is the solution of 
dY 

— =f(Y),Y°=Y . (10) 



Theorem 3. RLA-I will always discover an SE strategy. 

Proof: We prove this by contradiction. Suppose that the 
process generated by Eq. (f7]i converges to a non-Stackelberg 
equilibrium. We know that the equilibrium solution of Eq. 
(TlOl i that characterize the long term behavior of RLA-I are by 
definition stationary points. This implies that RLA-I will only 
converge to stationary points. This means that stationary points 
that are not SEs are stable, which contradicting Theorem 2. ■ 

B. Reinforcement Learning based Algorithm-II (RLA-II) 

In this subsection, to promote cooperation among the com- 
peting FUs, we further design a simple and intuitive rule each 
FU has that links the other FUs' strategies to its own current 
transmission strategy. Such a rule reflects an awareness that 
there are strategic interactions during the learning procedure. 
FUs with such beliefs do not correctly perceive how the future 
strategies of their competitors depend on the past. Thus, we 
propose a conjecture model concerning the way in which the 
FUs react to each other IT6l . 

Each FU i G W\{0} thinks any change in its current 
transmission strategy will induce other FUs to make well- 
defined changes in the corresponding time slot. Specifi- 
cally, we need to express FU i's expected contention mea- 
sure &* +1 (p%)) = IW\ {(M } vtjl in E q- © through 
^(Pl^)). that is 

r i (p^)=^Kp^ ) )-^(^ i 1 -yy. (id 

where Si > is the belief factor. In other words, every FU i 
believes that a change of j/?"t — y\ ■ in its strategy at time 
t + 1 will induce a change of Si (yfj — ylj i ) in the expected 
contention measure exactly related to the transmission strate- 
gies of other FUs. It's worth mentioning here that although FU 
i may be aware that other FUs are subject to many influences 
on their strategies, when making its own decision, it is only 
concerned with other FUs' reactions to itself. That means FU i 
does not take into account whether or not FU j (j € A/"\{0, i}) 
might react to changes in transmission strategy made by FU 

l(leJtf\{0,i,j}). 

Among different possibilities of capturing the expected con- 
tention measure 6* +1 (p^J the linear model represented 
in Eq. (fTTT i is the simplest form based on which one FU can 
model the impact of its changes in transmission strategy to 
other competing FUs. The conjecture model deployed by the 
FUs are based on the concept of reciprocity, which refers to 
interaction mechanisms in which the FUs repeatedly interact 
when choosing the power level. If they realize that their 
probabilities of interacting with each other in the future is high, 
they will consider their influence on the strategies of other 
FUs, which is captured in the conjecture model by the positive 



dt 



{ m ° 1 m ° / \ 1 

c^o (pojo. y-o ) - E j/o.z^o (po,i,yJi ) - 7-0 E v§,i ln (yoj MJ j > if * = 0; 



1=1 



mi / 

1=1 v 



(9) 



otherwise. 



parameter 5,. Otherwise, they will act myopically, which is the 
same learning process as in previous Section IV-A. 
Following the previous analysis, Eq. (0]l thus becomes 



FU1 



E 



y&i 



)r i (p^L ) ) 



(12) 



Therefore, we propose RLA-II by replacing Uf +1 (PijJ in Eq. 
(0 with Eq. ([L21 to discover the SE strategy. RLA-II is similar 
to RLA-I except that the FUs update their Q-values based on 
Eq. © in RLA-I. 

V. Numerical Results 

We provide insight into the performance comparison of 
the both learning algorithms through simulation experiments. 
We consider multiple FUs coexisting with one MU over a 
spectrum with bandwidth of 1MHz. The minimum SINR 
targets of MU and FUs are set to be 3dB and 5dB, respectively. 
The noise of the measurement is according to a zero-mean 
Gaussian noise with the power of a 2 = — HOdBm. For 
simplicity, we suppose that the belief factors Si are all equal to 
2, \/i £ A/ r \{0}, i.e., the FUs have the same conjecture ability, 
and the additional circuit power consumption is lOdBm for all 
users. The femtocells are assumed to be uniformly distributed 
within a circle centered at the MBS with radius of 500m. 
The coverage radius of femtocell is 20m. The channel gains 
are generated by a log-normal shadowing path loss model, 
hij — where dij is the distance between user i and 

BS j, and n is the path loss exponent. In simulation, n is 
assumed to be 4. The action set of transmission power levels 
for all users is {20, 25, 30}dBm. 

Fig. |2 Fig. [3] and Fig. |4] show the learning process of 
expected utilities for each user in the network. The results 
are compared with 

1) The fully cooperative power allocation game with com- 
plete information case: each user knows all the utility 
functions and transmission power levels in the network, 
and then the optimal utilities in the power allocation 
process can be calculated by each user according to 
Eq. (Q}. This scenario is equivalent to the classic power 
control game in femtocell networks without the pricing 
schemes from macrocells as shown in (3). 

2) The non-cooperative learning process of the power con- 
trol game without any information exchange case: each 
user's transmission decisions in the learning process 
are self-incentive with myopic best response correspon- 
dence, which is the scenario discussed in iflOl . 

The first observation from our simulation results is that, 
whenever we generate random initial probability distributions 






Non-cooperative 




— * — RLA-I 




— i — RLA-II 




With Complete Information 



40 60 
Time Slot t 



Fig. 2. Learning process of the expected utilities for FU 1. 
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Fig. 3. Learning process of the expected utilities for FU 2. 



of the power levels, the equilibrium state of the transmission 
strategies achieved by all users is independent of these initial 
values. That is, there exists an SE in the Stackelberg learning 
game, which confirms Theorem 2. 

Secondly, we can find from the curves that the expected 
utilities of all users in the learning process will finally converge 
(or approach) to the equilibrium point in the complete cooper- 
ation case, and these simulation results validate the conclusion 
of Theorem 3. In addition, both the proposed reinforcement 
learning schemes outperform the non-cooperative case, this is 
because for a Stackelberg learning game, knowing more can 
improve not only the leader's (MU) own utility, but also the 
utilities of the followers (FUs). Meanwhile, the RLA-II can 
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Fig. 4. Learning process of the expected utilities for MU. 
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Fig. 5. The expected SINRs for FUs versus 7q. 



achieve better performance than RLA-I, which is due to the 
fact that all FUs have the incentive to achieve better utilities 
thus behave reciprocally at some extent. 

Fig. [5] shows the expected SINRs of FUs using RLA- 
I and RLA-II, respectively, versus the minimum macrocell 
QoS requirement 7^. It is observed that for the same 7J, 
the expected SINRs of the FUs with RLA-II is in general 
higher than that with RLA-I. This is in accordance with our 
previous discussions. It is also worth mentioning that when 7q 
is sufficiently large, the expected SINRs of the FUs approach 
to zero for the two learning algorithms. This is because when 
7q is sufficiently large, there is no femtocell active in the 
network. 

VI. Conclusion 

In this paper, energy efficiency is investigated for the uplink 
transmission in a spectrum-sharing-based two-tier femtocell 
network using stochastic learning theory. The Stackelberg 
learning framework is adopted to jointly study the utility 
maximization of the MU and FUs. Based on reinforcement 



learning, we propose two algorithms, namely, RLA-I and RLA- 
II, whose convergence has been further proven theoretically. 
Numerical experiments illustrate that the reciprocity-inspired 
RLA-II converges more quickly and achieves better perfor- 
mance compared to RLA-I and the non-cooperative learning 
scheme. This comes at the expense of more side information 
at the FUs. Concludingly, both learning algorithms show the 
potential in improving the energy efficiency in the femtocell 
networks. 
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